We have a very interesting dataset before us. Our data is one row per pixel from aerial pictures, and our goal is to identify one specific object. So before we dive into the data, let’s orient ourselves.
First, there are some characteristics we know will affect our analysis, most of which we simply will have no power to overcome at the moment.
First we’re dealing with color and pictures. There are dozens of factors which affect how an object will appear on a picture and for which the RGB scale will not capture including time of day, current weather conditions, and resolution. As always, being an observer, the question comes to mind “how difficult can it be to include these pieces of information within the dataset?” We will, of course, not expect any change to our dataset, but we know the science of light has much to say about color perception and photography.
Second our observations are NOT independent! Is not the information of one pixel informing us about a neighbor pixel? There is a rich amount of graphical (dare I say, a network of sorts?) information here. To treat image pixels as a random sample of independent observations is to ignore this fundamental characteristic. But for now, we have no such data so we will need to continue with an assumption of quasi-independence.
As most good unbiased scientists do, we have started off on the right foot by first identifying all our limitations. Now that this is out of the way, let’s think about our goals.
We are faced with two competing concerns: not missing any persons (false negatives) and not sending rescue workers to the wrong places (false positives). For this project, we will call false negatives “neglected persons”; these are the people we are not sending teams to at all. Our sensitivity tells us the percentage of people who are out there we are actually identifying. Our false positives we will call “false pursuits.” These are trips/extractions we send our rescue team on which are fruitless. Our specificity tells us the proportion of trips that are good pursuits.
For each model threshold choice, we define our primary strategy as this: achieve the highest specificity possible. If multiple highest specificity values are tied, achieve the highest sensitivity possible in that subgroup.
I believe this is the most ethical path for two reasons: (1) the perspectives of rescue workers and neglected persons, and (2) the need for triaging.
The Perspective of a Rescue Worker When we are working to extract people from a natural disaster and deliver resources, our most precious resource is time. My main goal is for every single trip to be successful. I would want the highest probability possible that the first 10, 100, 1000 locations I’m being sent to actual results in time well spent. By making our specificity as high as possible, we are optimizing this goal.
The Perspective of a Neglected Person The more heartbreaking, but equally essential perspective is that of the additional neglected persons we leave out of our group. Here it is much more difficult to put myself into their mindset and heartset. However, I believe if I were in that place, I would rather not be rescued because I knew other people were being rescued and not because the rescue team was on a false pursuit. Our rescue teams are likely going to perform trips at the same speed no matter what threshold we have, so would we rather our first 100 trips result in 99 rescues or 80?
The Inevitability of Triaging Finally, it is very possible we will triage locations. In selecting highest specificity we are (1) making it very unlikely for the triaging of locations to be based on tarp identification probability (other factors would influence where teams go to first) and (2) if all locations are exhausted, we are making the movement of our threshold needle a hopeful move. We can start scouting lower probability sites if all higher probability sites have been attended to. It is always a more motivating and uplifting decision to add more locations into the set than be forced to cut them out of the set.
For ‘best’ model choice, we defined our primary strategy as this: achieve the highest accuracy possible. If multiple highest accuracy values are tied between models, achieve the highest specificity possible in that subgorup.
Here our accuracy will tell us how our models compare in their ability to be correct within our optimize-specificity strategy.
Scatterplot
The first thing we’ll do is plot our data as a 3D scatterplot to get a sense of how our categories land. We see a pretty clear division for our tarp and not-tarp categories for many of our observations. Our biggest cluster of overlap is in cube range of 90-140 for all three colors. There is a sizable chunk of tarps here, yet many rooftop and soil pixels all over this range as well. The observations overall follow a cone-like shape, with lower RGB values clustered closer together and higher RGB values more spread out. This is a clear sign of multi-collinearity and non-constant variance. This is likely due to relationship between color and light. We also run a correlation and see pretty high correlations among our three color predictors. We know this causes instability when working with coefficients.
Bubble Plot
Now we look at our data as a one-rest binary classification, and we make a tallied version of our dataset. Our bubble chart shows the same 3d coordinates, except the bubble size indicates how many of our Y/N tarp flag values are at that coordinate. The vast majority of our pixels have unique RGB values. There are a few bigger spots along the line of red=255 and green=255 and blue=195-205. The biggest visible bubble is at pure white (255,255,255) for non-tarp items.
Holdout Set
As we move to model fitting, we will use a 95/5 split, giving us 3,163 points in our final holdout set.
#bring data in and get ready
haiti_tarps <- read.csv("HaitiPixels.csv")
attach(haiti_tarps)
#identify the unqiue categories we have
unique(Class)
## [1] "Vegetation" "Soil" "Rooftop" "Various Non-Tarp"
## [5] "Blue Tarp"
#see if we have any missing data
which(is.na(haiti_tarps))
## integer(0)
haiti_tarps$Class <- as.factor(haiti_tarps$Class)
#make new one/rest binary variable
haiti_tarps$tarp_flag <- as.factor(ifelse(haiti_tarps$Class=="Blue Tarp","Y","N"))
library(plotly)
color_3d <- plot_ly(haiti_tarps,x=~Red, y=~Green,z=~Blue,color=~Class, colors=c('Blue','Tan','Brown','Grey','DarkGreen'), size=1)
color_3d <- color_3d %>% add_markers()
color_3d
cor(haiti_tarps[,c("Red","Blue","Green")])
## Red Blue Green
## Red 1.0000000 0.9355294 0.9815987
## Blue 0.9355294 1.0000000 0.9648024
## Green 0.9815987 0.9648024 1.0000000
#https://dplyr.tidyverse.org/reference/tally.html
#https://plotly.com/r/bubble-charts/
library(tidyverse)
library(dplyr)
color_by_count <- haiti_tarps %>% select(Red,Green,Blue,tarp_flag) %>% group_by(Red,Green,Blue,tarp_flag) %>% tally()
color_3dbubble <- plot_ly(color_by_count,x=~Red, y=~Green,z=~Blue,color=~tarp_flag, colors=c('Grey','Blue'),size=~n*2)
color_3dbubble <- color_3dbubble %>% add_markers()
color_3dbubble
#make holdout set about ~5% of data, a few thousand rows
set.seed(1)
ho_index <- sample.int(nrow(haiti_tarps),size=3163,replace=F)
#training set and holdout set
t_haiti_tarps <- haiti_tarps[-ho_index,]
ho_haiti_tarps <- haiti_tarps[ho_index,]
#for actual model training we do not need the Class variable
train_XY <- t_haiti_tarps[,!names(t_haiti_tarps) %in% c("Class")]
We choose to run KNN for k equal to every odd integer 1-100. After our model training, we have a best tuning choice of k = 5 for which our accuracy is 0.997 and our AUC is 0.998. Now we know that once our k value is chosen, our entire model is chosen. We plot the 10 ROC curves for the 10 folds from k choice 5 along with the ROC curve of k = 5 applied the entire training dataset. Our 10 folds follow very closely, giving a good sense of stability to our k choice.
We use roc_curve() function to automatically identify the thresholds which result in distinct specificity/sensitivity values. Our maximum specificity is 0.997 at threshold 1 with an associated sensitivity of 0.992. We create a confusion matrix of this threshold. We also pull the opposing strategy as well, looking at the lowest non-0 threshold (maximizing sensitivity).
So in the case of our lowest threshold, 0.0769, we have 269 false pursuits and we neglect 29 persons/groups. A trip has a 87.5% probability of actually finding persons, and we make rescue trips to 98.4% of all tarps.
And in the case of our highest threshold, 1, we have 24 false pursuits and we neglect 258 persons/groups. A trip has a 98.6% probability of actually finding persons, and we make rescue trips to 86.5% of all tarps.
Here I choose a threshold of 1. We give our rescue teams a 98.68% probability their trip is well-spent, and neglect 258 persons. We see that if we were to have a threshold of 0.0769 and those 258 people/groups were included in our pot, we go from a 98.6% chance of a successful rescue team extraction to only a 87.5% chance. Choosing this threshold reduces our accuracy to 0.995.
This Looks…Curious
We plot our predicted probabilities against the true classifications, and spot something interesting. While we see a good division for the majority of our observations, there is an intriguing spot between probabilities just above 0 and below .11. There seems to be a good number of tarps and non-tarps clustered in this spot. When we dig into these observations, we see their RGB values are almost all pure white! In our training dataset we have a few hundred in this cluster. Here we see a shortcoming of our predictive power. There are clearly some situations which cause our tarps to show up as (255,255,255) RGB pixels which is indistinguishable from non-tarp objects.
#train KNN model for best K neighbors value
library(caret)
set.seed(1)
train_accy_Ks <- train(tarp_flag~., data = train_XY, method = "knn", tuneGrid = data.frame(k=seq(1,100,2)), trControl = caret::trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE,allowParallel = TRUE))
#save the tuned k neighbor choice
kchoice <- train_accy_Ks$bestTune[1,1]
#save the training data accuracy value
knn_trainaccuracy <- filter(train_accy_Ks$results,k == train_accy_Ks$bestTune[1,1])$Accuracy
#save the AUC value
knn_auc <- train_accy_Ks$pred %>% yardstick::roc_auc(truth=obs,N)
#once we have our model, let's look at how our accuracy does, print out our best neighbor value in the subtitle
train_accy_Ks %>% ggplot(aes(x=seq_along(Accuracy), y=Accuracy)) + ggtitle("Overall Accuracy by K Neighbor choice", subtitle=sprintf("Best K Tuning Value %.0f",train_accy_Ks$bestTune[1]))
library(tidyverse)
#make a layered ROC curve for each of the folds
knnfolds_ROC_curve <- train_accy_Ks$pred %>% group_by(Resample)%>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle(sprintf("ROC Curves by KNN K choice %.0f",kchoice))
#make an ROC curve for the best tuned model, include accuracy and AUC
knn_ROC_curve <- train_accy_Ks$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle(sprintf("KNN ROC Curve -- K choice %.0f",kchoice),subtitle=sprintf("Accuracy %.6f -- AUC %.6f",knn_trainaccuracy, knn_auc[1,3,1]))
#print out our ROC curves side by side
library(gridExtra)
grid.arrange(knnfolds_ROC_curve,knn_ROC_curve,ncol=2,widths=(8:7))
library(tidyverse)
#print out top and bottom of threshold table
threshold_table <- train_accy_Ks$pred %>% yardstick::roc_curve(truth=obs, estimate=N)
head(threshold_table)
tail(threshold_table)
library(caret)
#here we want to focus on thinking about our threshold choice, make confusion matricies for our strategy and opposing strategy threshold choice.
knn_highstvty_cm <- confusionMatrix(as.factor(ifelse(train_accy_Ks$pred$Y >= 0.0769 ,"Y","N")), train_accy_Ks$pred$obs)$table
knn_highspcty_cm <- confusionMatrix(as.factor(ifelse(train_accy_Ks$pred$Y >= 1,"Y","N")), train_accy_Ks$pred$obs)$table
knn_highstvty_cm
## Reference
## Prediction N Y
## N 57896 29
## Y 269 1884
knn_highspcty_cm
## Reference
## Prediction N Y
## N 58141 258
## Y 24 1655
#false discovery rate
knn_fdr <- knn_highspcty_cm[2,1]/(knn_highspcty_cm[2,1] + knn_highspcty_cm[2,2])
#precision
knn_ppv <- knn_highspcty_cm[2,2]/(knn_highspcty_cm[2,1] + knn_highspcty_cm[2,2])
#recalculate accuracy with chosen threshold
knn_accy <- (knn_highspcty_cm[1,1] + knn_highspcty_cm[2,2])/(knn_highspcty_cm[1,1] + knn_highspcty_cm[2,2] + (knn_highspcty_cm[1,2] + knn_highspcty_cm[2,1]))
#store the important statistics from this analysis
model_stats <- data.frame(Method = sprintf("KNN k=%.0f",kchoice), Accuracy = knn_accy, AUC = knn_auc[1,3,1], Threshold = 1, Sensitivity = 0.9916284, Specificity = 0.9968504 ,FDR = knn_fdr,PPV = knn_ppv, False_Pursuits = knn_highspcty_cm[2,1], Neglected_Persons = knn_highspcty_cm[1,2],Good_Pursuits = knn_highspcty_cm[2,2])
library(tidyverse)
library(ggplot2)
#make a plot with our predicted probabilities along x, separated out by true classification.
train_accy_Ks$pred %>% ggplot(aes(x=train_accy_Ks$pred$Y,y=train_accy_Ks$pred$obs)) + geom_count(aes(x=train_accy_Ks$pred$Y,y=train_accy_Ks$pred$obs)) + labs(x="KNN Predicted probability of being tarp",y="True Classification of Tarp/Not Tarp") + ggtitle("Probability Distribution by True Classification")
#what are the RGB values for the cluster around 0-.11 where N and Y seem indiscernable?
knnprobabilitycluster <- t_haiti_tarps[filter(train_accy_Ks$pred, Y > 0 & Y < .11)$rowIndex,c("Red","Blue","Green","tarp_flag")]
knnprobabilitycluster
We perform 10 fold cross-validation with LDA. Our accuracy plot shows the highest accuracy is 0.986 on fold 6. Our resulting AUC is 0.989 We plot our ROC curves as we did for KNN. There is a less consensus among the curves, with a bigger variation on the top left corner between folds. In our CV, the best model was chosen and we apply it to the entire training data for our ROC curve on the righthand side.
In choosing a threshold we take the same approach as we did for KNN. Since our LDA method has approximately ~47k thresholds, we have thousands of records where our specificity is maxed out and other thousands where sensitivity is maxed out. So we peek at the confusion matrices of our two opposing optimizations: (1) maximize specificity and within that subset maximize sensitivity and (2) maximize sensitivity and within that subset maximize specificity.
We see here that to optimize sensitivity is more troubling than it was in KNN. There are 2383 false pursuits (we know here our dataset has only 2022 actual tarp pixels). However, in our optimization of specificity, we only identify 825 tarps. We have 0 false pursuits here, giving our rescue team extremely high chances of having only successful missions, but we are neglecting 1088 persons. So even in our optimal model, we can already label this as subpar to KNN. We choose a threshold of 0.99996 for LDA on our highest accuracy model. This reduces our accuracy to 0.982.
In our plot of predicted probabilities versus the true classification, we can see how the division is much less clear than it was for KNN. The same phenomenon is not happening with pure white pixels, but the pattern we saw there is almost the entire story of what we see here with LDA. There is a very even spread of probabilities for both Y and N values. The big bubble of non-tarps very close to 1 is the point at which our specificity decreases to less than 1 so we can see that we chose a threshold value which is greater than this bubble’s probability value and less than the next Y bubble value.
#10 fold cross validation for lda
set.seed(1)
train_accy_lda <- caret::train(tarp_flag~., data = train_XY, method = "lda", tuneGrid=data.frame(.parameter = seq(1,1,1)), trControl = caret::trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE,allowParallel = TRUE))
lda_train_accy <- max(train_accy_lda$resample$Accuracy)
lda_auc <- train_accy_lda$pred %>% yardstick::roc_auc(truth=obs,N)
#once we have our model, let's look at how our accuracy does, print out accuracy in subtitle
train_accy_lda$resample %>% ggplot(aes(x=Resample, y=Accuracy)) + geom_point(aes(x=Resample, y=Accuracy)) + ggtitle("Overall Accuracy by LDA Fold", subtitle=sprintf("Highest Accuracy %.6f",max(train_accy_lda$resample$Accuracy)))
library(tidyverse)
ldafolds_ROC_curve <- train_accy_lda$pred %>% group_by(Resample)%>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle(sprintf("ROC Curves by LDA folds"))
lda_ROC_curve <- train_accy_lda$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle("LDA ROC Curve",subtitle=sprintf("All Training data on highest accuracy model -- AUC %.6f",lda_auc[1,3,1]))
library(gridExtra)
grid.arrange(ldafolds_ROC_curve,lda_ROC_curve,ncol=2,widths=(8:7))
lda_thresholds <- train_accy_lda$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% filter(.threshold != "Inf" & .threshold != "-Inf")
#find maximum specificity and within that subset find maximum sensitivity
max_spcty <- max(lda_thresholds$specificity)
max_spcty_submaxsnty <- max(lda_thresholds[lda_thresholds$specificity==max_spcty,]$sensitivity)
optimal_spcty_threshold <- filter(lda_thresholds,specificity==max_spcty, sensitivity == max_spcty_submaxsnty)$.threshold
#find maximum specificity and within that subset find maximum sensitivity
max_snty <- max(lda_thresholds$sensitivity)
max_snty_submaxspcty <- max(lda_thresholds[lda_thresholds$sensitivity==max_snty,]$specificity)
optimal_snty_threshold <- filter(lda_thresholds,specificity==max_snty_submaxspcty, sensitivity == max_snty)$.threshold
library(caret)
lda_highstnty_cm <- confusionMatrix(as.factor(ifelse(train_accy_lda$pred$Y >= optimal_snty_threshold,"Y","N")), train_accy_lda$pred$obs)$table
lda_highspcty_cm <- confusionMatrix(as.factor(ifelse(train_accy_lda$pred$Y >= optimal_spcty_threshold,"Y","N")), train_accy_lda$pred$obs)$table
lda_highstnty_cm
## Reference
## Prediction N Y
## N 55782 172
## Y 2383 1741
lda_highspcty_cm
## Reference
## Prediction N Y
## N 58165 1088
## Y 0 825
#false discovery rate
lda_fdr <- lda_highspcty_cm[2,1]/(lda_highspcty_cm[2,1] + lda_highspcty_cm[2,2])
#precision
lda_ppv <- lda_highspcty_cm[2,2]/(lda_highspcty_cm[2,1] + lda_highspcty_cm[2,2])
#recalculate accuracy with chosen threshold
lda_accy <- (lda_highspcty_cm[2,2] + lda_highspcty_cm[1,1])/(lda_highspcty_cm[1,2] + lda_highspcty_cm[2,1] + lda_highspcty_cm[2,2] + lda_highspcty_cm[1,1])
#store all the important stats from this model analysis
model_stats <- data.frame(rbind(model_stats, cbind(Method = "LDA", Accuracy = lda_accy, AUC = lda_auc[1,3,1], Threshold = optimal_spcty_threshold, Sensitivity = max_spcty_submaxsnty, Specificity = max_spcty ,FDR = lda_fdr,PPV = lda_ppv, False_Pursuits = lda_highspcty_cm[2,1], Neglected_Persons = lda_highspcty_cm[1,2], Good_Pursuits = lda_highspcty_cm[2,2])))
library(tidyverse)
library(ggplot2)
ldaprobs <- train_accy_lda$pred %>% ggplot(aes(x=train_accy_lda$pred$Y,y=train_accy_lda$pred$obs)) + geom_count(aes(x=train_accy_lda$pred$Y,y=train_accy_lda$pred$obs)) + labs(x="LDA Predicted probability of being tarp",y="True Classification of Tarp/Not Tarp") + ggtitle("Probability Distribution by True Classification")
ldaprobs
#note the model summary
train_accy_lda$finalModel$means
## Red Green Blue
## N 162.706 152.5277 122.4408
## Y 169.574 186.3413 204.9472
When we run QDA there is a slight improvement overall from LDA, highest accuracy is 0.995 on fold 6. AUC is 0.998189 and the ROC curves show less variation. We like the comparision of our predicted probabilities plots as well with QDA showing better separation than LDA.
Similarly to LDA, our QDA has ~43k unique thresholds. So we peek at the confusion matrices of our two opposing cases: (1) maximize specificity and within that subset maximize sensitivity and (2) maximize sensitivity and within that subset maximize specificity.
We like the results here more than LDA, but it still doesn’t compete with KNN. Our high sensitivity model has only 446 false pursuits, but our optimization of specificity, bumped up only to 876 identified tarps. We have 0 false pursuits still, but we are still neglecting 1037 persons. While this is shows a ever so small improvement over LDA, it is still a more concerning situation than KNN numbers.
We choose a threshold of 0.99998 giving us a specificity of 1 and sensitivity of 0.4053 This reduces our accuracy to 0.98189.
#10 fold cross validation for qda
set.seed(1)
train_accy_qda <- caret::train(tarp_flag~., data = train_XY, method = "qda", tuneGrid=data.frame(.parameter = seq(1,1,1)), trControl = caret::trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE,allowParallel = TRUE))
qda_train_accy <- max(train_accy_qda$resample$Accuracy)
qda_auc <- train_accy_qda$pred %>% yardstick::roc_auc(truth=obs,N)
#once we have our model, let's look at how our accuracy does, print out accuracy in subtitle
train_accy_qda$resample %>% ggplot(aes(x=Resample, y=Accuracy)) + geom_point(aes(x=Resample, y=Accuracy)) + ggtitle("Overall Accuracy by QDA Fold", subtitle=sprintf("Highest Accuracy %.6f",max(train_accy_qda$resample$Accuracy)))
library(tidyverse)
qdafolds_ROC_curve <- train_accy_qda$pred %>% group_by(Resample)%>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle(sprintf("ROC Curves by QDA folds"))
qda_ROC_curve <- train_accy_qda$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle("QDA ROC Curve",subtitle=sprintf("All Training data on highest accuracy model -- AUC %.6f",qda_auc[1,3,1]))
library(gridExtra)
grid.arrange(qdafolds_ROC_curve,qda_ROC_curve,ncol=2,widths=(8:7))
qda_thresholds <- train_accy_qda$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% filter(.threshold != "Inf" & .threshold != "-Inf")
#find maximum specificity and within that subset find maximum sensitivity
max_spcty <- max(qda_thresholds$specificity)
max_spcty_submaxsnty <- max(qda_thresholds[qda_thresholds$specificity==max_spcty,]$sensitivity)
optimal_spcty_threshold <- filter(qda_thresholds,specificity==max_spcty, sensitivity == max_spcty_submaxsnty)$.threshold
#find maximum specificity and within that subset find maximum sensitivity
max_snty <- max(qda_thresholds$sensitivity)
max_snty_submaxspcty <- max(qda_thresholds[qda_thresholds$sensitivity==max_snty,]$specificity)
optimal_snty_threshold <- filter(qda_thresholds,specificity==max_snty_submaxspcty, sensitivity == max_snty)$.threshold
library(caret)
qda_highstnty_cm <- confusionMatrix(as.factor(ifelse(train_accy_qda$pred$Y >= optimal_snty_threshold,"Y","N")), train_accy_qda$pred$obs)$table
qda_highspcty_cm <- confusionMatrix(as.factor(ifelse(train_accy_qda$pred$Y >= optimal_spcty_threshold,"Y","N")), train_accy_qda$pred$obs)$table
qda_highstnty_cm
## Reference
## Prediction N Y
## N 57719 182
## Y 446 1731
qda_highspcty_cm
## Reference
## Prediction N Y
## N 58165 1037
## Y 0 876
#false discovery rate
qda_fdr <- qda_highspcty_cm[2,1]/(qda_highspcty_cm[2,1] + qda_highspcty_cm[2,2])
#precision
qda_ppv <- qda_highspcty_cm[2,2]/(qda_highspcty_cm[2,1] + qda_highspcty_cm[2,2])
#recalculate accuracy with chosen threshold
qda_accy <- (qda_highspcty_cm[1,1] + qda_highspcty_cm[2,2]) / (qda_highspcty_cm[1,1] + qda_highspcty_cm[2,2] + qda_highspcty_cm[2,1] + qda_highspcty_cm[1,2])
#store all the important stats from this model analysis
model_stats <- data.frame(rbind(model_stats, cbind(Method = "QDA", Accuracy = qda_accy, AUC = qda_auc[1,3,1], Threshold = optimal_spcty_threshold, Sensitivity = max_spcty_submaxsnty, Specificity = max_spcty ,FDR = qda_fdr,PPV = qda_ppv, False_Pursuits = qda_highspcty_cm[2,1], Neglected_Persons = qda_highspcty_cm[1,2],Good_Pursuits = qda_highspcty_cm[2,2])))
library(tidyverse)
library(ggplot2)
qdaprobs <- train_accy_qda$pred %>% ggplot(aes(x=train_accy_qda$pred$Y,y=train_accy_qda$pred$obs)) + geom_count(aes(x=train_accy_qda$pred$Y,y=train_accy_qda$pred$obs)) + labs(x="QDA Predicted probability of being tarp",y="True Classification of Tarp/Not Tarp") + ggtitle("Probability Distribution by True Classification")
qdaprobs
#note the QDA model summary
train_accy_qda$finalModel$means
## Red Green Blue
## N 162.706 152.5277 122.4408
## Y 169.574 186.3413 204.9472
Finally for logistic regression, we have a highest accuracy of 0.996338 on fold 9. The AUC is 0.998. The ROC curves show more consistency than LDA, but slightly less than QDA.
We see some clear improvement here over LDA and QDA. We maintain our 0 false pursuits, but we more than triple our number of correctly identified tarps, 1190! Our neglected persons count is 723. We see that if our strategy had been reversed, this method would give us pretty bad odds for rescue team trips with 6043 false pursuits! There are only 4 neglected persons here. So logistic regression does appear to have a leg up on both QDA and LDA; our predicted probability plot shows this clearer division.
We choose a threshold of 0.99916 giving us a training reduced data accuracy of 0.987965 with specificity of 1 and sensitivity of 0.83.
#10 fold cross validation for logistic regression
#https://daviddalpiaz.github.io/r4sl/the-caret-package.html#classification i had no idea how to call logistic regression with caret, ha
set.seed(1)
train_accy_lr <- caret::train(tarp_flag~., data = train_XY, method = "glm", family="binomial", tuneGrid=data.frame(.parameter = seq(1,1,1)), trControl = caret::trainControl("cv", number=10, returnResamp='all', savePredictions='final',classProbs=TRUE,allowParallel = TRUE))
lr_train_accy <- max(train_accy_lr$resample$Accuracy)
lr_auc <- train_accy_lr$pred %>% yardstick::roc_auc(truth=obs,N) %>% select(.estimate)
#once we have our model, let's look at how our accuracy does, print out highest accuracy in subtitle
train_accy_lr$resample %>% ggplot(aes(x=Resample, y=Accuracy)) + geom_point(aes(x=Resample, y=Accuracy)) + ggtitle("Overall Accuracy by LogReg Fold", subtitle=sprintf("Highest Accuracy %.6f",max(train_accy_lr$resample$Accuracy)))
library(tidyverse)
lrfolds_ROC_curve <- train_accy_lr$pred %>% group_by(Resample)%>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle(sprintf("ROC Curves by LogReg folds"))
lr_ROC_curve <- train_accy_lr$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% autoplot() + ggtitle("LogReg ROC Curve",subtitle=sprintf("All Training data on highest accuracy model -- AUC %.6f",lr_auc[1,1,1]))
library(gridExtra)
grid.arrange(lrfolds_ROC_curve,lr_ROC_curve,ncol=2,widths=(8:7))
lr_thresholds <- train_accy_lr$pred %>% yardstick::roc_curve(truth=obs, estimate=N) %>% filter(.threshold != "Inf" & .threshold != "-Inf")
#find maximum specificity and within that subset find maximum sensitivity
max_spcty <- max(lr_thresholds$specificity)
max_spcty_submaxsnty <- max(lr_thresholds[lr_thresholds$specificity==max_spcty,]$sensitivity)
optimal_spcty_threshold <- filter(lr_thresholds,specificity==max_spcty, sensitivity == max_spcty_submaxsnty)$.threshold
#find maximum specificity and within that subset find maximum sensitivity
max_snty <- max(lr_thresholds$sensitivity)
max_snty_submaxspcty <- max(lr_thresholds[lr_thresholds$sensitivity==max_snty,]$specificity)
optimal_snty_threshold <- filter(lr_thresholds,specificity==max_snty_submaxspcty, sensitivity == max_snty)$.threshold
library(caret)
lr_highstnty_cm <- confusionMatrix(as.factor(ifelse(train_accy_lr$pred$Y >= optimal_snty_threshold,"Y","N")), train_accy_lr$pred$obs)$table
lr_highspcty_cm <- confusionMatrix(as.factor(ifelse(train_accy_lr$pred$Y >= optimal_spcty_threshold,"Y","N")), train_accy_lr$pred$obs)$table
lr_highstnty_cm
## Reference
## Prediction N Y
## N 52122 4
## Y 6043 1909
lr_highspcty_cm
## Reference
## Prediction N Y
## N 58165 723
## Y 0 1190
#false discovery rate
lr_fdr <- lr_highspcty_cm[2,1]/(lr_highspcty_cm[2,1] + lr_highspcty_cm[2,2])
#precision
lr_ppv <- lr_highspcty_cm[2,2]/(lr_highspcty_cm[2,1] + lr_highspcty_cm[2,2])
#recalculate accuracy with chosen threshold
lr_accy <- (lr_highspcty_cm[2,2] + lr_highspcty_cm[1,1])/(lr_highspcty_cm[2,2] + lr_highspcty_cm[1,1] + lr_highspcty_cm[1,2] + lr_highspcty_cm[2,1])
#store all the important stats from this model analysis
model_stats <- data.frame(rbind(model_stats, cbind(Method = "LogReg", Accuracy = lr_accy, AUC = lr_auc[1,1,1], Threshold = optimal_spcty_threshold, Sensitivity = max_spcty_submaxsnty, Specificity = max_spcty ,FDR = lr_fdr,PPV = lr_ppv, False_Pursuits = lr_highspcty_cm[2,1], Neglected_Persons = lr_highspcty_cm[1,2],Good_Pursuits=lr_highspcty_cm[2,2])))
library(tidyverse)
library(ggplot2)
lrprobs <- train_accy_lr$pred %>% ggplot(aes(x=train_accy_lr$pred$Y,y=train_accy_lr$pred$obs)) + geom_count(aes(x=train_accy_lr$pred$Y,y=train_accy_lr$pred$obs)) + labs(x="LogReg Predicted probability of being tarp",y="True Classification of Tarp/Not Tarp") + ggtitle("Probability Distribution by True Classification")
lrprobs
#note the model summary
summary(train_accy_lr$finalModel)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.4060 -0.0210 -0.0016 0.0000 3.7775
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.17539 0.18770 0.934 0.35
## Red -0.25621 0.01272 -20.150 <2e-16 ***
## Green -0.21850 0.01351 -16.176 <2e-16 ***
## Blue 0.46861 0.01565 29.936 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 16953 on 60077 degrees of freedom
## Residual deviance: 1690 on 60074 degrees of freedom
## AIC: 1698
##
## Number of Fisher Scoring iterations: 12
So now with all our models trained and all our statistics in hand, we need to make a definition conclusion. Here we print out information for each model according to how it performs against the entire training dataset. this serves as our performance metrics table 10-fold CV in part 1 of the project.
As we identified earlier, we will use accuracy to determine the best model here, which we see is KNN at 99.53% accuracy. I would also note we are okay with KNN as the optimal choice here for two reasons:
We have the best number of good pursuits here with 1655 out of our training data’s total of 1913 tarps. This gives us by far the best sensitivity.
The ROC curves for our 10 folds were very in alignment, more-so than any other method (though QDA gave it a run for its money). This gives us confidence of how our model will perform on the holdout set in part 2. It is a good sign of a stable model.
We are happy with KNN in this approach, but we know there is a caveat: KNN requires a full model space mapping for every single prediction, in order to calculate who the closest neighbors are. It very well may be that we need a faster model such as the logistic regression, where we can take the coefficients and threshold and use it without relying on the original dataset for future predictions and decision-making.
Logistic regression would be the second choice; it showed a good predictive power even without a thorough addressing of the multi-collinearity and non-constant variance of our predictors. A transformation that could retain all of the information would allow us to investigate the potential of logistic regression further. QDA had a more stable ROC curve, but it was the worst sensitivity by far.
Finally, I believe it may be an interesting pursuit to work through the data either (1) in a staging process or (2) by utilize splines. Since we know the classifications of every pixel, we might find something fruitful in cutting through our data in stages. We would find what objects are most distinct from tarps, identify them and remove them from our set. Then within that smaller set, identify a different non-tarp object, with the ability to use a different model in this stage than before. It is possible that different objects are more discernible from tarps with different methods. It may also be fruitful to utilize splines to split our data into different sections, based on variance or overall luminance. We noted that lower light cluster in our EDA where objects blended into tarps more, and we saw how indiscernable the white pixels were in our KNN investigation.
How much will this model help?
It’s difficult to tell for sure. I believe this predictive modeling would save lives if applied. But it’s true power is yet to be tested with our holdout set. We shall have to wait for part 2 for a clear answer to this question.
#round our statistics for better viewing
model_stats$Accuracy <- round(as.numeric(model_stats$Accuracy),6)
model_stats$AUC <- round(as.numeric(model_stats$AUC),6)
model_stats$Threshold <- round(as.numeric(model_stats$Threshold),6)
model_stats$Sensitivity <- round(as.numeric(model_stats$Sensitivity),6)
model_stats$Specificity <- round(as.numeric(model_stats$Specificity),6)
model_stats$FDR <- round(as.numeric(model_stats$FDR),6)
model_stats$PPV <- round(as.numeric(model_stats$PPV),6)
model_stats